Rank Methods #1733

0x0L · 2017-11-21T17:03:41Z

Closes Rank function #1731
Tests added / passed
Passes git diff upstream/master **/*py | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

shoyer · 2017-11-23T00:46:21Z

Thanks for putting this together!

Rather than using the injection stuff in ops.py (which we would like to eliminate), I would prefer to simply implement this as a method on Dataset and DataArray.

Also take a look at the use of apply_ufunc in #1640 for a straightforward example of wrapping external functions. I think that would make things easier here.

shoyer · 2017-11-23T00:46:44Z

xarray/core/duck_array_ops.py

@@ -23,6 +23,8 @@
 try:
    import bottleneck as bn
    has_bottleneck = True
+    # monkeypatch numpy with rankdata
+    np.rankdata = bn.rankdata


Please don't monkeypatch -- this would affect NumPy for all xarray users.

0x0L · 2017-11-24T00:41:29Z

@shoyer Thanks for the comment and the tips! I'll make the changes asap, hopefully this week-end

0x0L · 2017-11-24T23:23:36Z

Dropped support for Dataset. I don't think there's much use for it anyway.

0x0L · 2017-11-30T19:08:45Z

xarray/core/dataarray.py

+        array([ 1.,   2.,   3.])
+        Dimensions without coordinates: x
+        """
+        import bottleneck as bn


IMO the exception is explicit enough

jhamman · 2017-11-30T23:13:23Z

xarray/core/dataarray.py

+        ranks = apply_ufunc(func, self,
+                            dask='parallelized',
+                            keep_attrs=True,
+                            output_dtypes=[np.float_],


based on the docs, it looks like the return type is always np.float64.

jhamman · 2017-11-30T23:16:58Z

xarray/core/dataarray.py

+        Ranks begin at 1, not 0. If pct is True, computes percentage ranks.
+
+        NaNs in the input array are returned as NaNs.
+


Add a note here that this requires bottleneck.

jhamman · 2017-11-30T23:18:36Z

xarray/tests/test_dataarray.py

+        # str
+        y =  DataArray(['c', 'b', 'a'])
+        self.assertDataArrayEqual(y.rank('dim_0'), x)
+


need to add test coverage for the pct=True option.

shoyer

This looks great -- can you take a look at implementing this for Dataset, too? For consistency, it's nice if we have the same methods for both Dataset/DataArray if possible. I think this should be a pretty straightforward use of apply:

def rank(self, dim, pct=False):
    return ds.apply(lambda array: array.rank(dim, pct=pct))

0x0L · 2017-12-04T20:06:41Z

@shoyer when the ranking dimension is missing from an array in the dataset, should we do nothing, or put 1s everywhere ? None of these looks appealing or natural.

shoyer · 2017-12-04T20:15:49Z

when the ranking dimension is missing from an array in the dataset, should we do nothing, or put 1s everywhere ?

Good point, maybe it needs to be slightly more complicated than apply.

Two other options are raise an error (safest) or to drop all such variables from the result. The later is similar to what we do in aggregations like sum() if an argument is non-numeric. I think this is probably the best choice: it strikes a good balance between not returning an invalid result and not raising errors so often as to be annoying.

0x0L · 2017-12-04T21:07:40Z

Should i move the implementation on Variable and use temp datasets like quantile does ?

shoyer · 2017-12-04T21:10:40Z

Should i move the implementation on Variable and use temp datasets like quantile does ?

I would be happy either way here. In some ways, yes, that would be the cleanest solution.

0x0L · 2017-12-05T19:04:07Z

xarray/core/dataarray.py

+                            kwargs=dict(axis=axis)).transpose(*self.dims)
+        if not pct:
+            return ranks
+        return (ranks - 1) / (self.count(dim) - 1)


@shoyer with pct=True this is not the same implementation as pandas which returns rank / count. I think this one makes more sense but users might be suprised

Good catch, thanks for calling this out. The pandas definition makes a little more sense to me, as "fraction of the data with this rank or higher". I don't know how I would describe your version in a simple sentence.

Normalized ? It lies in [0, 1] not [1/Nsamples, 1]

My main issue is that I would expect the pct rank to average to 0.5 but with pandas we have

In [2]: pd.Series([1,2,3]).rank(pct=True) Out[2]: 0 0.333333 1 0.666667 2 1.000000

On the other hand, with my implem, pct rank is not defined for a dim of length 1

For the record, Bottleneck move_rank returns a polar rank in [-1, 1]

My main issue is that I would expect the pct rank to average to 0.5

I agree that this would be a nice property, but I'm not sure it's worth the loss in interpretability.

How about a str arg for specifying the normalization scheme ?
'pct', 'norm' and 'polar'

What would polar mean?

All things being equal, we do try to lean towards doing what pandas does in xarray because consistency makes it easier to use both packages together.

polar would be [-1, 1]
I get your point of view, i'll make the change to mimick pandas

jhamman · 2017-12-07T23:41:13Z

xarray/core/dataarray.py

+        ----------
+        dim : str
+        pct : bool, optional
+        keep_attrs : bool, optional


One last comment, it would be great if you could expand the parameters section of the docstrings to include a 1 line description of each. For example:

Parameters ---------- dim : str Dimension(s) over which to compute rank. pct : bool, optional If True, compute percentage ranks, otherwise compute integer ranks. keep_attrs : bool, optional If True, the dataset's attributes (`attrs`) will be copied from the original object to the new one. If False (default), the new object will be returned without attributes.

max-sixty · 2017-12-08T19:23:52Z

This looks great @0x0L thanks & congrats on your first contribution!

max-sixty · 2017-12-08T19:24:52Z

xarray/tests/test_variable.py

+        # int
+        v = Variable(['x'], [3,2,1])
+        expect = bn.rankdata(v.data, axis=0)
+        np.testing.assert_allclose(v.rank('x').values, expect)


FYI for the future, xarray.test has these natively, and assert_equal rather than the older self. methods
(tbc, no need to change)

A shameless copy/paste from from the test above :)

shoyer

A couple of minor points.

shoyer · 2017-12-09T21:38:22Z

xarray/core/dataset.py

+        for name, var in iteritems(self.variables):
+            if name in self.data_vars and dim in var.dims:
+                variables[name] = var.rank(dim, pct=pct)
+                variables.update({


This looks a little complicated to me, and also will drop non-dimension coordinates. As a simpler rule, how about simply keeping all existing coordinates?

This could look like:

if name in self.data_vars: if dim in var.dims: variables[name] = var.rank(dim, pct=pct) else: variables[name] = var

It would also be good at add test cases verifying:

that coordinates stick around

that invalid data variables are dropped

shoyer · 2017-12-09T21:43:21Z

xarray/core/variable.py

+        import bottleneck as bn
+
+        if isinstance(self.data, dask_array_type):
+            raise TypeError("rank does not work for arrays stored as dask "


please add a test that this error is raised, e.g., using pytest.raises or raises_regex

shoyer · 2017-12-09T21:43:51Z

xarray/core/dataset.py

+            Variables that do not depend on `dim` are dropped.
+        """
+        if dim not in self.dims:
+            raise ValueError('Dataset does not contain the dimension: %s' % dim)


please add test coverage for this condition

shoyer

Look good to me! I'll merge this in a day or two unless anyone has other comments/suggestions.

shoyer · 2017-12-12T22:29:43Z

xarray/core/dataset.py

+            else:
+                variables[name] = var
+
+        coord_names = set(k for k in self.coords if k in variables)


could be simplified as just coord_names = set(self.coords) now

shoyer · 2017-12-18T16:51:05Z

Thank you!

0x0L force-pushed the rank_ffill branch 6 times, most recently from a00b098 to c8997c7 Compare November 21, 2017 22:39

0x0L changed the title ~~[WIP] initial support for rank and ffill~~ [WIP] initial support for rank Nov 21, 2017

0x0L force-pushed the rank_ffill branch from d1e1b8d to 4d63e1e Compare November 21, 2017 23:36

shoyer reviewed Nov 23, 2017

View reviewed changes

shoyer mentioned this pull request Nov 23, 2017

Utility to restore original dimension order after apply_ufunc #1739

Open

0x0L force-pushed the rank_ffill branch from 4d63e1e to 8cc142b Compare November 24, 2017 23:02

0x0L force-pushed the rank_ffill branch 2 times, most recently from f9c0bc5 to 94ac0ca Compare November 29, 2017 22:52

0x0L commented Nov 30, 2017

View reviewed changes

jhamman reviewed Nov 30, 2017

View reviewed changes

shoyer reviewed Dec 4, 2017

View reviewed changes

0x0L commented Dec 5, 2017

View reviewed changes

jhamman reviewed Dec 7, 2017

View reviewed changes

jhamman changed the title ~~[WIP] initial support for rank~~ Rank Methods Dec 7, 2017

max-sixty reviewed Dec 8, 2017

View reviewed changes

shoyer reviewed Dec 9, 2017

View reviewed changes

0x0L force-pushed the rank_ffill branch from 6b8c1b0 to cd152a4 Compare December 10, 2017 14:57

0x0L added 7 commits December 12, 2017 21:07

initial support for rank

9a2d73a

added pct kwargs

061df2c

minor changes

aa26204

move to variable

80099f4

some polish

e3d3276

fix docstring

420c308

dataset fix and more tests

34ab7c4

0x0L force-pushed the rank_ffill branch from 9f33920 to 34ab7c4 Compare December 12, 2017 20:09

shoyer approved these changes Dec 12, 2017

View reviewed changes

minor changes

80f6711

shoyer merged commit a0ef2b7 into pydata:master Dec 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rank Methods #1733

Rank Methods #1733

0x0L commented Nov 21, 2017 •

edited

Loading

shoyer commented Nov 23, 2017

shoyer Nov 23, 2017

0x0L commented Nov 24, 2017

0x0L commented Nov 24, 2017

0x0L Nov 30, 2017

jhamman Nov 30, 2017

jhamman Nov 30, 2017

jhamman Nov 30, 2017

shoyer left a comment

0x0L commented Dec 4, 2017

shoyer commented Dec 4, 2017

0x0L commented Dec 4, 2017

shoyer commented Dec 4, 2017

0x0L Dec 5, 2017

shoyer Dec 5, 2017

0x0L Dec 5, 2017 •

edited

Loading

shoyer Dec 5, 2017

0x0L Dec 5, 2017

shoyer Dec 5, 2017

0x0L Dec 5, 2017

jhamman Dec 7, 2017

max-sixty commented Dec 8, 2017

max-sixty Dec 8, 2017 •

edited

Loading

0x0L Dec 8, 2017

shoyer left a comment

shoyer Dec 9, 2017

shoyer Dec 9, 2017

shoyer Dec 9, 2017

shoyer Dec 9, 2017

shoyer left a comment

shoyer Dec 12, 2017

shoyer commented Dec 18, 2017

		Ranks begin at 1, not 0. If pct is True, computes percentage ranks.

		NaNs in the input array are returned as NaNs.

Rank Methods #1733

Rank Methods #1733

Conversation

0x0L commented Nov 21, 2017 • edited Loading

shoyer commented Nov 23, 2017

Choose a reason for hiding this comment

0x0L commented Nov 24, 2017

0x0L commented Nov 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

0x0L commented Dec 4, 2017

shoyer commented Dec 4, 2017

0x0L commented Dec 4, 2017

shoyer commented Dec 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0x0L Dec 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty commented Dec 8, 2017

max-sixty Dec 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Dec 18, 2017

0x0L commented Nov 21, 2017 •

edited

Loading

0x0L Dec 5, 2017 •

edited

Loading

max-sixty Dec 8, 2017 •

edited

Loading